Skip to content

NVIDIA-596: Enable dpu healthcheck #2941

Open
tsorya wants to merge 2 commits intoopenshift:masterfrom
tsorya:jkary-dpu-health-check
Open

NVIDIA-596: Enable dpu healthcheck #2941
tsorya wants to merge 2 commits intoopenshift:masterfrom
tsorya:jkary-dpu-health-check

Conversation

@tsorya
Copy link
Copy Markdown
Contributor

@tsorya tsorya commented Mar 19, 2026

NVIDIA-596: Enable DPU healthcheck and bump Multus CNI to 1.1.0
Add configurable DPU node lease health monitoring to detect when the
DPU-side OVN-Kubernetes component is down or not installed. Without
this, pods are scheduled to DPU-accelerated nodes regardless of DPU
readiness, causing silent 2-minute CNI ADD timeouts with no visibility
or automated remediation.

DPU lease configuration:

  • Read dpu-node-lease-renew-interval and dpu-node-lease-duration from
    the hardware-offload-config ConfigMap (defaults: 10s / 40s).
  • Inject OVNKUBE_NODE_LEASE_RENEW_INTERVAL and OVNKUBE_NODE_LEASE_DURATION
    env vars into ovnkube-controller for dpu-host/dpu node modes.
  • Script-lib translates env vars into --dpu-node-lease-renew-interval
    and --dpu-node-lease-duration CLI flags for ovnkube-node.
  • Setting either value to 0 disables the health check; both are
    normalized to 0 when either is 0.
  • Lease namespace is derived via downward API (fieldRef).

Bump Multus CNI API version to 1.1.0:

Made-with: Cursor

Jira: https://issues.redhat.com/browse/NVIDIA-596

Summary by CodeRabbit

  • New Features

    • Added DPU node-lease configuration with configurable renew-interval and duration, applied conditionally for DPU node modes and exposed to node agent.
    • Added default lease values and validation to ensure sane renew/duration settings.
    • Updated Multus CNI spec to v1.1.0.
  • Tests

    • Added tests verifying lease env var rendering and behavior across deployment modes.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Mar 19, 2026

@tsorya: This pull request references NVIDIA-596 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

NVIDIA-596: pass DPU lease config via env vars on dpu-host/dpu DaemonSets

Add configurable DPU node lease renew interval and duration as env vars on ovnkube-controller, gated to dpu-host/dpu modes. Script-lib builds CLI flags from env vars. Values read from hardware-offload-config ConfigMap with defaults 10s/40s. Setting either to 0 disables the health check. Lease namespace derived via fieldRef.

Jira: https://issues.redhat.com/browse/NVIDIA-596

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

1 similar comment
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Mar 19, 2026

@tsorya: This pull request references NVIDIA-596 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

NVIDIA-596: pass DPU lease config via env vars on dpu-host/dpu DaemonSets

Add configurable DPU node lease renew interval and duration as env vars on ovnkube-controller, gated to dpu-host/dpu modes. Script-lib builds CLI flags from env vars. Values read from hardware-offload-config ConfigMap with defaults 10s/40s. Setting either to 0 disables the health check. Lease namespace derived via fieldRef.

Jira: https://issues.redhat.com/browse/NVIDIA-596

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Add DPU node lease configuration: new bootstrap fields and defaults, parse/validate ConfigMap keys, expose string values to templates, inject env vars into ovnkube DaemonSet for DPU modes, pass lease flags from node script to ovnkube, add ConfigMap defaults and tests.

Changes

Cohort / File(s) Summary
Bootstrap & render
pkg/network/ovn_kubernetes.go, pkg/bootstrap/types.go
Add DpuNodeLeaseRenewInterval and DpuNodeLeaseDuration fields and defaults; read/validate hardware-offload-config keys and expose stringified values to template render data.
Templates / Manifests
bindata/network/ovn-kubernetes/managed/ovnkube-node.yaml, bindata/network/ovn-kubernetes/self-hosted/ovnkube-node.yaml
Conditionally inject OVNKUBE_NODE_LEASE_RENEW_INTERVAL and OVNKUBE_NODE_LEASE_DURATION env vars into ovnkube-controller when .OVN_NODE_MODE is dpu-host or dpu and renew interval ≠ "0".
Node script (ovnkube CLI)
bindata/network/ovn-kubernetes/common/008-script-lib.yaml
Introduce dpu_lease_flags, append --dpu-node-lease-renew-interval/--dpu-node-lease-duration when env vars set, and include ${dpu_lease_flags} in the ovnkube command args.
Default ConfigMap
hack/hardware-offload-config.yaml
Add dpu-node-lease-renew-interval-in-seconds: "10" and dpu-node-lease-duration-in-seconds: "40" to ConfigMap data.
Tests & fixtures
pkg/network/.../kube_proxy_test.go, pkg/network/ovn_kubernetes_test.go, pkg/network/ovn_kubernetes_dpu_host_test.go
Update fixtures to include new lease fields with defaults; add tests that render templates and assert presence/absence and exact values of lease env vars for full, dpu-host, and dpu modes.
Other manifest
bindata/network/multus/multus.yaml
Bump embedded daemon-config.json cniVersion from "0.3.1" to "1.1.0".

Sequence Diagram(s)

sequenceDiagram
    participant ConfigMap as hardware-offload ConfigMap
    participant Bootstrap as bootstrapOVNConfig
    participant Renderer as template renderer
    participant K8sAPI as Kubernetes API (DaemonSet)
    participant NodeScript as ovnkube node script
    participant ovnkube as ovnkube process

    ConfigMap->>Bootstrap: read dpu-node-lease-* keys
    Bootstrap->>Bootstrap: parse & validate values
    Bootstrap->>Renderer: provide DpuNodeLeaseRenewInterval/DpuNodeLeaseDuration (strings)
    Renderer->>K8sAPI: create/update DaemonSet with env vars (conditional on node mode)
    K8sAPI->>NodeScript: schedule/run ovnkube node script (on node)
    NodeScript->>NodeScript: build dpu_lease_flags from env vars
    NodeScript->>ovnkube: invoke ovnkube with ${dpu_lease_flags}
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 8 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Test code violates requirement 4 (Assertion Messages) and requirement 5 (Consistency with Codebase) due to multiple assertions without meaningful failure messages. Add meaningful failure messages to all bare Gomega assertions in both TestOVNKubernetesLeaseEnvVars and TestDpuLeaseConfig tests to match established codebase patterns.
✅ Passed checks (8 passed)
Check name Status Explanation
Title check ✅ Passed The title 'NVIDIA-596: Enable dpu healthcheck' clearly and specifically describes the main objective of this pull request: enabling DPU (Data Processing Unit) health check functionality, which is the primary purpose across all the file changes.
Stable And Deterministic Test Names ✅ Passed Two new Go test functions with static, deterministic names follow stable naming conventions. Test cases use static string names defined in struct literals with no dynamic data, ensuring consistency across runs.
Microshift Test Compatibility ✅ Passed PR adds only standard Go unit tests in pkg/network package using testing package, not Ginkgo e2e tests. Tests cover configuration rendering and environment variables, all compatible with MicroShift.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Tests added are standard Go unit tests using testing.T, not Ginkgo e2e tests. SNO compatibility check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR changes add only DPU lease configuration via environment variables with no scheduling constraints, affinity rules, replica logic, or topology-dependent features.
Ote Binary Stdout Contract ✅ Passed The PR changes do not violate the OTE Binary Stdout Contract. Modifications use klog.Warningf() for validation logging (writes to stderr), with no direct stdout writes in process-level code.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The new tests render Kubernetes DaemonSet templates and verify environment variables using only fake/mock clients with no external connectivity, confirming IPv6-only and disconnected CI environment compatibility.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from jcaamano and pperiyasamy March 19, 2026 04:15
@tsorya tsorya marked this pull request as draft March 19, 2026 12:04
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2026
@tsorya tsorya force-pushed the jkary-dpu-health-check branch from 62c31b1 to b5a3d66 Compare March 20, 2026 03:45
@tsorya tsorya marked this pull request as ready for review March 20, 2026 03:46
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 20, 2026
@openshift-ci openshift-ci Bot requested review from danwinship and pliurh March 20, 2026 03:46
@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented Mar 20, 2026

/retest-required

@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented Mar 20, 2026

Blocked by k8snetworkplumbingwg/multus-cni#1490

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 20, 2026
@yingwang-0320
Copy link
Copy Markdown

@tsorya Could you please help rebase this PR, then I can build an image to run some pre-merge testing.

@tsorya tsorya force-pushed the jkary-dpu-health-check branch from 1eb0381 to 6b9ed3a Compare March 31, 2026 02:12
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 31, 2026
@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented Mar 31, 2026

@tsorya Could you please help rebase this PR, then I can build an image to run some pre-merge testing.

done

@yingwang-0320
Copy link
Copy Markdown

/verified by pre-merge testing.
@tsorya I built image with this PR and ran CNO and multicast cases, all passed.
But I can't build an image with both #2941 and #2944, because there's conflict in file:bindata/network/ovn-kubernetes/common/008-script-lib.yaml

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 31, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@yingwang-0320: This PR has been marked as verified by pre-merge testing..

Details

In response to this:

/verified by pre-merge testing.
@tsorya I built image with this PR and ran CNO and multicast cases, all passed.
But I can't build an image with both #2941 and #2944, because there's conflict in file:bindata/network/ovn-kubernetes/common/008-script-lib.yaml

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Comment thread bindata/network/ovn-kubernetes/common/008-script-lib.yaml Outdated
Comment thread bindata/network/ovn-kubernetes/common/008-script-lib.yaml Outdated
Comment thread bindata/network/ovn-kubernetes/common/008-script-lib.yaml Outdated
Comment thread hack/hardware-offload-config.yaml
daemon-config.json: |
{
"cniVersion": "0.3.1",
"cniVersion": "1.1.0",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multus was updated to the new version which enables CNI status

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPdated PR description

@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label Apr 16, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Apr 16, 2026

@tsorya: This pull request references NVIDIA-596 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

NVIDIA-596: pass DPU lease config via env vars on dpu-host/dpu DaemonSets

Add configurable DPU node lease renew interval and duration as env vars on ovnkube-controller, gated to dpu-host/dpu modes. Script-lib builds CLI flags from env vars. Values read from hardware-offload-config ConfigMap with defaults 10s/40s. Setting either to 0 disables the health check. Lease namespace derived via fieldRef.

Jira: https://issues.redhat.com/browse/NVIDIA-596

Summary by CodeRabbit

  • New Features

  • Added DPU node lease configuration support with customizable renewal intervals and durations for improved stability in hardware-accelerated networking environments

  • Updated Multus CNI plugin to support specification version 1.1.0

  • Tests

  • Added test coverage for DPU node lease environment variable configuration across different deployment modes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/network/ovn_kubernetes.go`:
- Around line 1094-1102: If either ovnConfigResult.DpuNodeLeaseRenewInterval or
ovnConfigResult.DpuNodeLeaseDuration is 0 we should normalize both to 0 so the
disable semantics are consistent; update the logic around the current checks to
first detect if either field == 0 and set both
ovnConfigResult.DpuNodeLeaseRenewInterval = 0 and
ovnConfigResult.DpuNodeLeaseDuration = 0, otherwise keep the existing validation
that when both are non-zero and DpuNodeLeaseDuration <=
DpuNodeLeaseRenewInterval you log the warning and reset to
DPU_NODE_LEASE_RENEW_INTERVAL_DEFAULT and DPU_NODE_LEASE_DURATION_DEFAULT.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 9e7b87eb-40e5-48db-92c5-608250f639d9

📥 Commits

Reviewing files that changed from the base of the PR and between 6b9ed3a and 7db199b.

📒 Files selected for processing (3)
  • bindata/network/ovn-kubernetes/common/008-script-lib.yaml
  • hack/hardware-offload-config.yaml
  • pkg/network/ovn_kubernetes.go
✅ Files skipped from review due to trivial changes (1)
  • hack/hardware-offload-config.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • bindata/network/ovn-kubernetes/common/008-script-lib.yaml

Comment thread pkg/network/ovn_kubernetes.go Outdated
@danwinship
Copy link
Copy Markdown
Contributor

  • please squash the new commit back into the first commit.
  • and update the multus commit to have a commit message explaining why you changed it
  • look into what coderabbit said
  • if you want to undo the renaming ("-in-seconds") you can... I guess there's an argument to be made for being consistent with the ovn-k option name

@tsorya tsorya force-pushed the jkary-dpu-health-check branch from 7db199b to 556c81f Compare April 16, 2026 23:38
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 20, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 20, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, tsorya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 20, 2026
@wizhaoredhat
Copy link
Copy Markdown
Contributor

LGTM!

@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented Apr 27, 2026

/retest-required

@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented Apr 27, 2026

/test?

@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented Apr 28, 2026

/retest

@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented Apr 28, 2026

/payload 4.22 ci blocking
/payload 4.22 nightly blocking

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 28, 2026

@tsorya: trigger 5 job(s) of type blocking for the ci release of OCP 4.22

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aks
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/80e8be70-42fa-11f1-9152-f48b1625dff6-0

trigger 13 job(s) of type blocking for the nightly release of OCP 4.22

  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-upgrade-ovn-single-node
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips
  • periodic-ci-openshift-release-main-ci-4.22-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-1of2
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-2of2
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-1of3
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-2of3
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-3of3
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv4
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/80e8be70-42fa-11f1-9152-f48b1625dff6-1

@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented Apr 28, 2026

/payload 4.22 ci blocking
/payload 4.22 nightly blocking

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 28, 2026

@tsorya: trigger 5 job(s) of type blocking for the ci release of OCP 4.22

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aks
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d32851d0-4347-11f1-8bbc-ea530d32dce6-0

trigger 13 job(s) of type blocking for the nightly release of OCP 4.22

  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-upgrade-ovn-single-node
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips
  • periodic-ci-openshift-release-main-ci-4.22-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-1of2
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-2of2
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-1of3
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-2of3
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-3of3
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv4
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d32851d0-4347-11f1-8bbc-ea530d32dce6-1

@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented May 2, 2026

/payload 4.22 ci blocking
/payload 4.22 nightly blocking

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 2, 2026

@tsorya: trigger 5 job(s) of type blocking for the ci release of OCP 4.22

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aks
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/46a6e790-45d5-11f1-9a42-2e6691300348-0

trigger 13 job(s) of type blocking for the nightly release of OCP 4.22

  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-upgrade-ovn-single-node
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips
  • periodic-ci-openshift-release-main-ci-4.22-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-1of2
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-2of2
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-1of3
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-2of3
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-3of3
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv4
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/46a6e790-45d5-11f1-9a42-2e6691300348-1

@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented May 2, 2026

/payload 4.22 ci blocking
/payload 4.22 nightly blocking

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 2, 2026

@tsorya: trigger 5 job(s) of type blocking for the ci release of OCP 4.22

  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aks
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8d512770-4650-11f1-9e5b-4564d59cf248-0

trigger 13 job(s) of type blocking for the nightly release of OCP 4.22

  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-upgrade-ovn-single-node
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-fips
  • periodic-ci-openshift-release-main-ci-4.22-e2e-azure-ovn-upgrade
  • periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-rt-upgrade
  • periodic-ci-openshift-hypershift-release-4.22-periodics-e2e-aws-ovn-conformance
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-1of2
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-serial-2of2
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-1of3
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-2of3
  • periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-3of3
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv4
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv6

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8d512770-4650-11f1-9e5b-4564d59cf248-1

@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented May 3, 2026

Latest run

periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ipi-ovn-ipv4

{ fail [k8s.io/kubernetes/test/e2e/apimachinery/crd_validation_rules.go:157]: expect error contains "failed rule", got "the server could not find the requested resource"}

aggregator-periodic-ci-openshift-release-main-ci-4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade

openshift-cluster-network-operator-2941-ci-4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-rt-upgrade

pod "network-check-target-5pbpt" is using the default service account

openshift-cluster-network-operator-2941-periodics-e2e-aws-ovn
FAIL: TestCreateCluster/Main/break-glass-credentials/independent_signers

@tsorya
Copy link
Copy Markdown
Contributor Author

tsorya commented May 6, 2026

/verified by @tsorya

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label May 6, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@tsorya: This PR has been marked as verified by @tsorya.

Details

In response to this:

/verified by @tsorya

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD b1101d1 and 2 for PR HEAD e8d6ace in total

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 6, 2026
@tsorya tsorya force-pushed the jkary-dpu-health-check branch from e8d6ace to 6401033 Compare May 7, 2026 01:02
@openshift-ci-robot openshift-ci-robot removed the verified Signifies that the PR passed pre-merge verification criteria label May 7, 2026
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label May 7, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

New changes are detected. LGTM label has been removed.

@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 7, 2026
tsorya and others added 2 commits May 6, 2026 21:04
Add configurable DPU node lease health monitoring to detect when the
DPU-side OVN-Kubernetes component is down or not installed. Without
this, pods are scheduled to DPU-accelerated nodes regardless of DPU
readiness, causing silent 2-minute CNI ADD timeouts with no visibility
or automated remediation.

DPU lease configuration:
- Read dpu-node-lease-renew-interval and dpu-node-lease-duration from
  the hardware-offload-config ConfigMap (defaults: 10s / 40s).
- Inject OVNKUBE_NODE_LEASE_RENEW_INTERVAL and OVNKUBE_NODE_LEASE_DURATION
  env vars into ovnkube-controller for dpu-host/dpu node modes.
- Script-lib translates env vars into --dpu-node-lease-renew-interval
  and --dpu-node-lease-duration CLI flags for ovnkube-node.
- Setting renew-interval to 0 disables the health check; duration
  must always be > 0 (required by ovn-kubernetes).
- Lease namespace is derived via downward API (fieldRef).

Jira: https://issues.redhat.com/browse/NVIDIA-596
Made-with: Cursor
Signed-off-by: Igal Tsoiref <[email protected]>
Co-authored-by: Cursor <[email protected]>
Signed-off-by: Igal Tsoiref <[email protected]>
Co-authored-by: Cursor <[email protected]>
@tsorya tsorya force-pushed the jkary-dpu-health-check branch from 6401033 to a5d3f19 Compare May 7, 2026 01:04
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

@tsorya: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-upgrade a5d3f19 link true /test e2e-aws-ovn-upgrade
ci/prow/e2e-azure-ovn-upgrade a5d3f19 link true /test e2e-azure-ovn-upgrade
ci/prow/e2e-aws-ovn-upgrade-ipsec a5d3f19 link true /test e2e-aws-ovn-upgrade-ipsec
ci/prow/4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade a5d3f19 link false /test 4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade
ci/prow/4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade a5d3f19 link false /test 4.22-upgrade-from-stable-4.21-e2e-azure-ovn-upgrade
ci/prow/e2e-aws-ovn-serial-2of2 a5d3f19 link true /test e2e-aws-ovn-serial-2of2
ci/prow/security a5d3f19 link false /test security

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants